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Abstract 

We report a novel algorithm, iBLUP, to impute missing genotypes by simultaneously and comprehensively using identity by 
descent and linkage disequilibrium information. The simulation studies showed that the algorithm exhibited drastically 
tolerance to high missing rate, especially for rare variants than other common imputation methods, e.g. BEAGLE and 
fastPHASE. At a missing rate of 70%, the accuracy of BEAGLE and fastPHASE dropped to 0.82 and 0.74 respectively while 
iBLUP retained an accuracy of 0.95. For minor allele, the accuracy of BEAGLE and fastPHASE decreased to —0.1 and 0.03, 
while iBLUP still had an accuracy of 0.61 .We implemented the algorithm in a publicly available software package also named 
iBLUP. The application of iBLUP for processing real sequencing data in an outbred pig population was demonstrated. 
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Introduction 

Benefited from the advances of sequencing technologies, 
Genome- Wide Association Studies (GWAS) have revealed sub- 
stantial genetic loci controlling human diseases and agriculturally 
important traits [1-3]. However, the identified loci collectively 
explain only a small proportion of total variation [4-7]. In 
addition to the path of common diseases and common variants, 
the new path of common disease and rare variants shed a new 
hope to have a better understand of complex traits [8]. 
Multiplexing is one the advances that revolutionized the high 
throughput Genotyping By Sequencing (GBS). Samples are 
individually tagged and pooled into a single lane of flow cell. It 
exponentially increases the number of samples analyzed in a single 
run without dramatically increasing cost and time [9] . 

Recently, several GBS methods used for both inbred and 
outbred population have been developed [10, 1 1]. The challenge is 
that the sequencing data contains a lot of missing genotypes. 
Imputation of missing genotypes at high missing rate is hard and 
imputation for rare variants are extreme hard, especially for 
general outbred populations because of the high degree of 
heterogeneity and phase ambiguity in the haplotype [12]. 

The information for imputation can be divided into two 
categories. One is linkage disequilibrium (LD) among genetic loci; 
the other is the relationship, termed identity by decent (IBD), 
among individuals [13]. Imputation methods have been developed 
to use either of them, or both with different degrees of complexity. 
These methods include allele frequencies based methods(PLINK, 



SNPMSTAT, UNPHASED, and TUNA), Hidden Markov Chain 
based methods(IMPUT, MACH, fastPHASE), mixed model based 
methods (S-MM,M-MM) and graphic theory based method(BEA- 
GLE) [14—22]. A clear linkage phase, such as a haplotype, is the 
most desirable situation for most of the algorithms to work with 
[13]. However, phasing becomes extremely difficult with GBS at 
low coverage with high missing rate, especially in outbred 
population such as human, maize landrace, dog, cattle and pig 
where heterogeneity is high [23]. As we known, none of the 
existing methods can work well for the GBS data with low 
coverage and high missing rate in outbred population and no 
convenient software can impute the missing genotypes based on 
this kind of data. The objective of this study was to make full usage 
of LD and IBD simultaneously and develop a genotype imputation 
algorithm and software with tolerance to high missing rate, 
especially for rare variants. 

Methods 

Approval by the Institutional Animal Care and Use Committee 
of Shanghai Jiao Tong University (contract no. 2011-0033) was 
given for all experimental procedures involving pigs in the present 
study. AH the 72 sequenced pigs were housed in Shanghai 
Xiangxin Livestock Ltd. Co., Shanghai, China, and were raised 
according to the standard practice for housing and care of 
Xiangxin Livestock Ltd. Co. (http://www.shxxgx.com/sygl.htm). 
Additional information of the sequenced pigs was shown in Table 
SI. 
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1 iBLUP method 

The chromosomes were divided into a large number of blocks 
on the basis of the extent of LD, and the LD of any two markers in 
a block is necessarily greater than some criteria. The marker that is 
less than some criteria will be removed from the block and will be 
imputed by a single variable BLUP model. All the markers in one 
block were analyzed by modeling using multivariate BLUP, and 
missing genotypes were predicted simultaneously. The imputation 
model for each marker in the block was: 




yi=X t bt + Ziai + e i (1) 

Where j, is an observed genotype vector for the number of copies 
of the minor allele (0, 1 or 2) for the marker, The length of the 
vector equals the number of individuals, A, is the fixed effect and X, 
is the design matrix for b„ a, is the effect underlying the observed 
genotype, Z, is the design matrix for a, and is the residual. The 
vector bi and a, have the same length asji;. 

Assumed that there are m markers in one block, i from 1 to m, 
we set 



Figure I.The mechanism and performance of iBLUP. The figure 
illustrates how observed genotype with missing value is imputed by 
Best Linear Unbiased Prediction (BLUP). The imputation uses both 
relationship among markers represented by Linkage Disequilibrium 
(LD), and relationship among individuals represented as Identity By 
Decent (IBD).G and K are the genetic variance-covariance matrix and 
marker-based kinship matrix respectively, and the symbol ® represents 
the Kronecker product. 
doi:1 0.1 371/journal.pone.01 01 025.g001 

data described in a later section. The symbol ® represents the 
Kronecker product. 
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The multivariate BLUP equations are: 
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where R is the residual variance-covariance matrix. Following 
Gengler et al. (2007), we set 



R = I ff 2 ff 2 =ff 2 = ... CT 2 = o.01,cov(e / ,e,)=0. 



G is the genetic variance-covariance matrix 



a \ = a \=--- a l„ =0.99,cov(fl, ) fl / ) = r, ;/t 7^, 

where r t j is the correlation between markers i and j. When the 
value of r tJ was >0.95 (or <— 0.95), r 5 was set to 0.95 (or —0.95) to 
avoid the singularities matrix [22]. 



K- 



k=\ 



is a marker-based kinship matrix [24] and we develop an iterative 
kinship algorithm to construct the matrix considering the missing 



2 Iterative kinship algorithm 

In the model, K was calculated using the iterative algorithm 
because of the high missing rate of the genomic genotype data. 
The initial K was calculated using only the genotype data from 
genotyped individuals, and homogenized after dividing by the 
number of common typed loci. The following K was calculated 
using the imputed genotype based on BLUP. When the difference 
between the correlation coefficients of two continuous K values is 
<0.001, the iterative process is finished. To ensure that K 
converged, we forced the imputed genotype of multivariate BLUP 
to be 0 (or 2), if it is <0 (or >2). To improve the computation 
speed, the Equation (2) was solved with an LU-factored method 
based on subroutines from the Intel R Math Kernel Library using 
parallel execution on Linux workstations. 

3 iBLUP pipeline 

The iBLUP pipeline can deal with both SNP array and 
sequencing data. The analysis steps are dependent on the kind of 
input data. If SNP array data is the input data, only step 5 need be 
performed. If user wants to impute sequencing data directly, all the 
5 steps can be run automatically. We introduce the analysis steps 
briefly as follow, and the details can be found in the online manual 
(http://klab.sjtu.edu.cn/iBLUP/). 

Step 1. To assign raw sequencing reads to individuals. 

Step 2. To filter reads on quality. 

Step 3. To map qualified reads to reference genome. The 
qualified reads are clustered by aligning reads with the reference 
genome using BWA [25] by the following steps. We first mapped 
all filtered reads to the reference panel and then attempted to 
divide remaining single reads into two or three shorter reads 
according to the sequences of enzyme cutting sites to align them 
individually with the reference genome because of uncertain 
ligation, such as adapter-DNA-DNA-adapter. We then used the 
sliding window method to query the remaining reads to ensure 
that we can make use of the incomplete reference panel. The rule 
of the sliding window is that the selected 25 uninterrupted bases 
from the first base at the 5 ' end of a read was aligned with the 
reference genome and a single base was added at each alignment 
until the maximum aligned sequence was reached. If the first 
25 bp at the 5' end were not aligned successfully, the next 25 bp, 
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i.e. base pairs 2-26, were aligned with the reference genome and 
so on. 

Step 4. To call genotype for each marker and individual. Reads 
that aligned with the reference panel were stored as "sam" files. 
Our iBLUP applied SAMtools to discover SNPs [26]. 

Step 5. To impute missing genotypes by iBLUP. 

4 Simulation data 

There were 3220 individuals genotyped on 9990 markers in the 
15 th QTL-MAS workshop [27]. The 9,990 SNP markers were 
evenly distributed on 5 chromosomes. Each chromosome had a 
size of 1 Morgan and carried 1998 SNPs equally distributed (1 
SNP every 0.05 cM). The 3220 individuals were from two 
generations, of which 220 individuals (200 females and 20 males) 
are parents, and the remaining .3000 individuals are offspring to be 
divided into 200 full-sib families consisting of 1 5 progeny per dam, 
which were generated by regular cross-hybridization of male and 
female parents (See Figure Sla). All the genotype of 9,990 SNPs 
on the 3220 individuals are known. Subset of individuals were 
sampled from the workshop data under two sampling schema: 1) 
Half sib schema. The sampled data included all the parents (20 
males and 200 females) and two progeny selected randomly from 
each full-sib family. This schema sampled more families with 
smaller family size (See Figure Sib). 2) Full sib schema. The 
sample data included 5 males, the corresponding mates and eleven 
progeny from each full sib family. This schema sampled fewer 
families with larger family size (See Figure Sic). A subset of 
markers were sample from the entire genetic markers (9,990) to 
investigate the effect of marker density. One of marker was 
selected for every five adjacent five markers. The sampled marker 
data set contained 1998 markers. The known genotype data were 
randomly masked as missing data at specific proportions. The 
proportions were ranged from 10% to 80%. Accuracy was 
calculated as Pearson correlation coefficient between known 
genotype and imputed. 

5 Real sequencing data 

The data were generated from an Illumina High-seq 2000 
sequencer. A flow-cell lane was used to sequence 72 pigs (36 
Landrace pigs and 36 Large White pigs) by using a DNA 
barcoding and genome reducing protocol (http:/ /klab. sjtu.edu. 
cn/GGRS/). There were 380,971,530 raw reads. The number of 
reads per individual ranged from 1,570,923 to 10,077,526 and the 
average was 5,022,387. 

Results 

We were motivated to develop a non-phasing algorithm [28] in 
5 a multivariate mixed model (M-MM) [22]. To take full advantage 
of a M-MM to fully incorporate both LD and IBD simultaneously, 
we made two major changes to enhance the representations of 
marker IBD information on the relationship matrix (K) among 
individuals, and marker LD information on the covariance matrix 
(G) of underlying variables (See Figure 1). First, we replaced 
overall IBD derived from the pedigree by the IBD derived from 
the markers. We developed an iterative algorithm to derive a 
robust IBD to situations with missing genotypes. Second, instead 
of using an arbitrary fixed size of LD block (e.g. two mega base 
pair), we implemented an optimization process to determine the 
LD threshold that to determine a variable size of LD block. The 
value of LD was represented as the squared correlation coefficient 
(r 2 ) calculated for the markers on the LD block. Our improved 
method of imputation by Best Linear Unbiased Prediction 
(iBLUP) had markedly higher accuracy than the conventional 
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Figure 2. Diagram of iBLUP pipeline. (1) Blended raw data were generated from the same flow cell lane. (2) Raw data were assigned to 
individuals according to the barcode. (3) Assigned reads were filtered for high quality reads according to several rules, including trimming the 
barcode and the last low quality base etc. (4) Filtered reads were aligned with the reference sequence. (5) SNP calling and genotyping were done 
according to the mapping results. (6) Missing genotypes were imputed by the iBLUP algorithm. 
doi:1 0.1 371 /journal.pone.01 01 025.g002 



M-MM method, even higher than the sophisticated graphic 
phasing method (BEAGLE), especially for situations with high 
missing rate. 

The performance of iBLUP was compared to three other types 
of commonly used methods, M-MM, BEAGLE and fastPHASE 
on a data set from 15 th QTLMAS. The iBLUP method 
outperformed over M-MM at all range of missing rates. When 
missing genotypes were below 40%, iBLUP had similar accuracy 
to BEAGLE and fastPHASE. With higher missing rates, iBLUP 
markedly outperformed BEAGLE and fastPHASE. At a missing 
rate of 50%, the accuracy of fastPHASE dropped to 0.79 while 
iBLUP retained an accuracy of 0.98. At a missing rate of 70%, the 
accuracy of BEAGLE fell to 0.82 while iBLUP held an accuracy of 
0.95 (Table 1). 

It is critical to dissect overall accuracy across all genotypes into 
major and minor allele genotypes. The major genotypes can be 
accurately imputed for rare variants if the accuracy of minor allele 
is ignored. The iBLUP method is superior to BEAGLE and 
fastPHASE not only on overall accuracy, but also for minor allele 
genotypes. When missing rate was 60%, the accuracy of 
fastPHASE dropped to -0.01, iBLUP kept an accuracy of 0.72 
for minor allele genotypes. At missing rate of 70%, the accuracy of 
BEAGLE dropped to —0.1, while iBLUP still retained an 
accuracy of 0.61 for minor allele genotypes (Table 2). 

We expanded our examination over a variety of circumstances. 
First we examined the responses of imputation accuracy to the 
level of kinship among individuals. Two subsets of the data from 
the QTLMAS 15th dataset were used for the examination. The 
two datasets contained all the available markers with the average 

Table 4. Imputation accuracy of real pig sequencing data. 



LD value of 0. 137, but varied on family structure. The first dataset 
consisted of a family structure of two full-sib individuals sampled 
from each family and the second dataset consisted of a family 
structure of parents and their eleven progeny. The average kinship 
coefficients were 0.0073 and 0.048 for the first and second family 
structures, respectively. In both cases, iBLUP had better imputa- 
tion accuracy than BEAGLE and fastPHASE at missing rates 
ranged from 60% to 80% (Table 3). 

Second, we examined the effect of markers density on 
imputation accuracy. The half sib family structure described 
above was used with two set of markers. One set contained all the 
available markers (9990 SNPs) with average LD of 0.137 and the 
other contained one fifth markers (1998 SNPs) with average LD of 
0.092.Compared to BEAGLE and fastPHASE, iBLUP performed 
higher imputation accuracy at missing rate ranged from 60% to 
80% in both cases (Table 3). 

We implemented the iBLUP algorithm in a publicly available 
pipeline also named iBLUP. The imputation step can be used 
independently for any genotype data, including the ones from 
DNA chips. The imputation step can also be used for raw 
sequencing data after four prior steps in iBLUP pipeline 
(Figure 2). 

The iBLUP pipeline provides users the option to optimize the 
LD threshold to determine the extent of the LD block. We 
examined the imputation accuracy with LD thresholds of 0.05, 0.1 
and 0.2 on the QTLMAS 15th dataset at missing rate of 30% 
(Figure S2). The analysis showed that an LD threshold of 0.1 
would achieve the highest imputation accuracy. Interestingly, this 
threshold was also observed as the optimum value of LD threshold 



Method 36% 50% 60% 70% 80% 

iBLIUP 0.97 0.97 0.96 0.96 0.95 

BEAGLE 0.92 0.92 0.91 0.91 0.91 

doi:1 0.1 371/journal.pone.0101 025.t004 
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for the pig sequencing data. This observation might help narrow 
the optimization range for LD to reduce computing cost in other 
experiments. 

We applied the iBLUP pipeline to sequencing data from a pig 
outbreed population for high-density SNP discovery and geno- 
typing. The sequences were collected in one lane of a single flow- 
cell at 72-plex by a genome reducing and sequencing protocol 
(http://klab.sjtu.edu.cn/GGRS/). There were 36% of missing 
data among 403,928 SNPs called. The accuracy of imputation is 
97% for iBLUP and 92% for BEAGLE. In order to make a 
comparison of imputation accuracy between iBLUP and BEA- 
GLE, known genotypes were masked as missing at four other 
different rates: 50%, 60%, 70% and 80%. The imputation 
accuracy decreased with the increase of missing rates for both 
methods. The iBLUP method outperformed over BEAGLE at all 
range of missing rates for the real sequencing data (Table 4). 

Discussion 

Missing genotype imputation is a critical process between 
sequencing and utilization for GWAS and genomic prediction 
[29-31]. Imputation accuracy relies on how well LD and IBD 
information are incorporated. IBD information is widely used in 
population and quantitative genetics. It is traditionally calculated 
from pedigree [32]. An alternative way to estimating IBD 
coefficients is from genetic markers [33]. This marker-based IBD 
more accurately specifies the actual difference between individuals 
and is superior to the pedigree kinship for genome-wide 
association studies [34]. The similar advantage was brought to 
genotype imputation in this study. 

The accuracy improvement of iBLUP also relate to the 
optimization to determine the LD block. We demonstrate that it 
was critical to have an appropriate LD block for imputation. Too 
broad or too narrow LD blocks would lead to the information 
dilution (Figure S2). The best LD block size can be determined 
by the optimization process in iBLUP. The suggested LD 
threshold of 0.1 can be used to save computing time or a starting 
value of optimization. 

The tolerance of iBLUP to high missing rate makes it possible to 
gain markers at high density. Take the pig data for example, 
haplotype blocks are about 10 kb within pig breed [35]. We need 
to identify markers that cover around 300,000 genomic locations 
at least for the GWAS or GS studies (one SNP per haplotype 
block). However, the commonly used pig DNA chip (Porci- 
neSNP60) only contains 60,000 SNPs [36]. In the present pig 
sequencing experiment, we only use one lane of flow cell for 72- 
plex. After imputation of 36% of missing genotypes, we gained 
more than 403,928 SNPs, which has much better coverage than 
the commonly used chip. 

One of the limitations of our proposed iBLUP is the computing 
speed for large sample size. When the sample size is medium (< 
300), the computing speed of iBLUP can compare with BEAGLE. 
Take the real 72 pig sequencing data (403,928 SNPs) for example, 
it takes about 20 minutes to perform imputation for both iBLUP 
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Supporting Information 

Figure SI The scheme of sampling Individuals. The top panel 
(a) is the complete pedigree of the 1 5 th QTLMAS workshop data 
[27]with 20 sires. Each Sire (S) mated with 10 Dams (D). Each 
dam produced 15 Progeny (P). All individuals are named 
randomly with sequential number. The first subscript indicates 
sire, the second indicates dam the third indicates progeny. The 
total numbers of individuals within each category are labeled on 
the fight column. The middle panel (b) keeps all the sires and 
dams. The difference (highlighted in red) is that each sire-dam 
family keeps only the first two progeny. This scheme has more 
families (all) and less progeny within family. As half sib is the major 
relationship among individuals, this scheme is named half sib 
scheme. The bottom panel (c) keeps the first 5 sires and their mates 
from panel a. Each sire-dam family keeps eleven progeny. This 
scheme has fewer families but more progeny within family. As full 
sib is the major relationship among individuals, this scheme is 
named full sib scheme. 
(TIF) 

Figure S2 Impact of linkage disequilibrium threshold. The 
Linkage Disequilibrium (LD) was calculated as the squared 
correlation coefficient. The adjacent markers with LD above the 
threshold were considered as a LD block to perform imputation. 
The evaluation was performed on subset of the 15 th QTLMAS 
common dataset by using the half-sib sampling scheme described 
in Figure SI. A total of 100 replications were conducted and the 
imputation accuracy is the average of 100 replications. 
(TIF) 

Table SI Additional information of the 72 pigs that were 

sequenced. 
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